Sketch-Based Estimation of Subpopulation-Weight

نویسندگان

  • Edith Cohen
  • Haim Kaplan
چکیده

Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records’ attributes. Bottom-k sketches are a powerful summarization format of weighted items that includes priority sampling [18] (pri) and the classic weighted sampling without replacement (ws). They can be computed efficiently for many representations of the data including distributed databases and data streams. We derive novel unbiased estimators and efficient confidence bounds for subpopulation weight. Our estimators and bounds are tailored by distinguishing between applications (such as data streams) where the total weight of the sketched set can be computed by the summarization algorithm without a significant use of additional resources, and applications (such as sketches of network neighborhoods) where this is not the case. Our rank conditioning (RC) estimator, is applicable when the total weight is not provided. This estimator generalizes the known estimator for pri sketches [18] and its derivation is simpler. When the total weight is available we suggest another estimator, the subset conditioning (SC) estimator which is tighter. Our rigorous derivations, based on clever applications of the Horvitz-Thompson estimator (that is not directly applicable to bottom-k sketches), are complemented by efficient computational methods. Performance evaluation using a range of Pareto weight distributions demonstrate considerable benefits of the ws SC estimator on larger subpopulations (over all other estimators); of the ws RC estimator (over existing estimators for this basic sampling method); and of our confidence bounds (over all previous approaches). Overall, we significantly advance the state-of-the-art estimation of subpopulation weight queries.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Estimation of Subpopulation Parameters in One-stage Cluster Sampling Design

Sometimes in order to estimate population parameters such as mean and total values, we extract a random sample by cluster sampling method, and after completing sampling, we are interested in using the same sample to estimate the desired parameters in a subset of the population, which is said subpopulation. In this paper, we try to estimate subpopulation parameters in different cases when one-st...

متن کامل

Finding Heavily-Weighted Features with the Weight-Median Sketch

We introduce the Weight-Median Sketch, a sub-linear space data structure that captures the most heavily weighted features in linear classifiers trained over data streams. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual infor...

متن کامل

CoMMEDIA: Separating Scaramouche from Harlequin to Accurately Estimate Items Frequency in Distributed Data Streams

In this paper, we investigate the problem of estimating the number of times data items that recur in very large distributed data streams. We present an alternative approach to the well-known CountMin Sketch in order to reduce the impact of collisions on the accuracy of the estimation. We propose to decrease, for each concerned item, the over-estimation that results from these collisions. Our sk...

متن کامل

A New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation

Recommender systems utilize information retrieval and machine learning techniques for filtering information and can predict whether a user would like an unseen item. User similarity measurement plays an important role in collaborative filtering based recommender systems. In order to improve accuracy of traditional user based collaborative filtering techniques under new user cold-start problem a...

متن کامل

Offline Sketch Parsing via Shapeness Estimation

In this work, we target at the problem of offline sketch parsing, in which the temporal orders of strokes are unavailable. It is more challenging than most of existing work, which usually leverages the temporal information to reduce the search space. Different from traditional approaches in which thousands of candidate groups are selected for recognition, we propose the idea of shapeness estima...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/0802.3448  شماره 

صفحات  -

تاریخ انتشار 2008